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ABSTRACT 

Reinforcement  learning  with  reward  shaping  is  a  well-established 
but  often  computationally  expensive  approach  to  multiagent 
problems.  Agent  partitioning  can  assist  in  this  computa¬ 
tional  complexity  by  treating  each  partition  of  agents  as  an 
independent  problem.  We  introduce  a  novel  agent  partition¬ 
ing  approach  called  Reward/Utility-Based  Impact  (RUBI). 
RUBI  finds  an  effective  partitioning  of  agents  while  requir¬ 
ing  no  prior  domain  knowledge,  provides  better  performance 
by  discovering  a  non-trivial  agent  partitioning,  and  leads  to 
faster  simulations.  We  test  RUBI  in  the  Air  Traffic  Flow 
Management  Problem,  where  there  are  simultaneously  tens 
of  thousands  of  aircraft  affecting  the  system  and  no  intu¬ 
itive  similarity  metric  between  agents.  When  partitioning 
with  RUBI  in  the  ATFMP,  there  is  a  37%  increase  in  per¬ 
formance,  with  a  510x  speed  up  per  simulation  step  over 
non-partitioning  approaches. 
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1.2.11  [Distributed  Artificial  Intelligence]:  Intelligent 
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1.  INTRODUCTION 

Two  key  elements  in  a  multiagent  reinforcement  learn¬ 
ing  system  is  minimizing  computation  time  and  maximizing 
coordination.  Reward  shaping  is  a  field  in  multiagent  re¬ 
inforcement  learning  that  focuses  on  the  design  of  rewards, 
and  has  been  shown  to  assist  in  multiagent  coordination. 
This  reward  shaping  is  typically  at  a  large  cost  to  compu¬ 
tation  time,  and  in  large,  highly  coupled  domains  reward 
shaping  quickly  becomes  computationally  intractable. 

Partitioning  agents  into  hierarchies  [3]  or  teams  [2]  speeds 
up  computation  time  for  extremely  large  domains  (approx. 
10000-40000  agents)  while  still  using  the  reward  shaping 
technique  without  approximation  error.  In  this  paper  we  in¬ 
troduce  Reward/Utility-Based  Impact  (RUBI)  scores.  RUBI 
partitions  agents  by  determining  the  effect  of  one  agent’s  ac- 
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tion  on  another  agent’s  reward.  Using  this  metric  it  devel¬ 
ops  similarity  metrics  between  all  agents  in  order  to  usefully 
partition  and  therefore  reduce  the  complexity  of  the  learning 
problem.  In  contrast  to  many  other  partitioning  approaches, 
this  has  the  advantage  of  requiring  no  domain  knowledge. 

The  contributions  of  this  work  are:  Generality:  RUBI 
requires  no  prior  knowledge  of  the  domain,  essentially  treat¬ 
ing  the  domain  as  a  black  box  to  obtain  rewards  from.  Us¬ 
ability:  RUBI  removes  the  need  to  derive  similarity  metrics, 
removing  the  need  for  domain  experts  in  situations  where  a 
domain  expert  isn’t  available.  Performance:  RUBI  discov¬ 
ers  non-trivial  agent  partitioning  by  using  a  reward  function 
to  partition  agents.  Speed:  RUBI  leads  to  a  larger  number 
of  partitions  without  losing  performance,  leading  to  more 
independence  and  therefore  faster  simulations. 

2.  RUBI 

In  this  work  we  introduce  an  autonomous  partitioning  al¬ 
gorithm  requiring  no  domain  knowledge,  the  Reward/Utility 
Based  Impact  algorithm.  We  develop  an  initial  agent  simi¬ 
larity  matrix  that  uses  no  knowledge  about  the  domain,  and 
partitions  agents  together  based  on  the  impact  of  one  agent 
to  another.  This  matrix  can  then  be  used  as  an  input  to  a 
hierarchical  agglomerative  clustering  algorithm. 

If  one  agent’s  action  heavily  impacts  another  agent’s  re¬ 
ward,  those  agents  are  coupled  enough  to  be  partitioned 
together.  The  RUBI  algorithm  computes  a  localized  reward 
for  each  agent  with  agent  i  in  the  system,  and  then  compares 
that  reward  to  the  localized  reward  for  each  agent  if  agent 
i  is  not  in  the  system.  This  partitioning  algorithm  is  based 
around  the  central  idea:  | Li(z)  —  Li(z  —  Zj)\  >  \Lk(z)  — 
Lk{z  —  Zj) |  =>  SIM(i,j)  >  SIM(k,j),  where  L;(2  —  zi) 
is  the  localized  reward  of  agent  i  if  j  is  not  in  the  system, 
Lk{z  —  Zj)  is  the  localized  reward  of  agent  k  if  j  is  not  in  the 
system,  and  L,  and  Lk  are  the  localized  rewards  of  i  and  k 
when  all  agents  are  in  the  system.  This  means  that  if  the 
localized  reward  of  agent  i  changes  more  than  the  localized 
reward  of  agent  k  when  agent  j  is  taken  out  of  the  system, 
agent  j  has  more  effect  on  agent  i  than  agent  k. 

Implementation:  The  RUBI  algorithm  first  initializes 
an  N  x  N  matrix  C,  where  N  is  the  number  of  agents  within 
the  system.  It  then  calculates  actions  based  on  the  ACTQ 
function,  which  is  typically  random  action  selection.  A  sim¬ 
ulation  is  then  ran  with  all  of  the  agents  in  the  system  and 
the  localized  reward  is  calculated  for  every  agent.  We  then 
remove  an  agent  from  the  system,  recalculate  the  reward  for 
each  agent  (since  this  is  a  localized  reward,  this  is  typically 
a  fast  operation),  and  update  the  impact  table  C. 
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Algorithm  1  Reward/Utility  Based  Impact  Algorithm 

1:  function  RUBI(sim) 

2:  C  <-  NxN 

3:  for  *  1  to  iterations  do 

4:  actions  <—  ACT () 

5:  sim.run(actions) 

6:  L(z)  <—  sim.getRewardsQ 

7:  for  r  <—  1  to  N  do 

8:  sim.removeAgent{r) 

9:  L(z  —  zr)  t—  sim.getRewardsQ 

10:  for  a  4 —  1  to  N  do 

IB  Cr,a  Cr,a  +  \La(z)  ~  La(z  —  ZT)\ 

12:  end  for 

13:  sim.addAgent(r) 

14:  end  for 

15:  end  for 

16:  end  function 


Reward/Utility-Based  Impact:  The  impact  data  used 
to  compute  the  similarity  matrix  is  obtained  from  a  local¬ 
ized  reward  or  utility  with  respect  to  an  agent.  Learning  in 
congestion  problems  with  local  rewards  typically  leads  to  a 
poor  solution,  as  agents  following  local  rewards  do  not  opti¬ 
mize  the  system-level  reward.  In  RUBI,  we  do  not  want  to 
learn,  but  instead  analyze  the  local  impact  one  agent  has  on 
another,  therefore  local  rewards  are  an  ideal  choice. 

RUBI  is  simply  an  accumulation  of  impact  scores  (line  11 
of  Algorithm  1).  Given  enough  iterations,  this  accumula¬ 
tion  is  informative  enough  to  perform  accurate  partitioning. 
In  this  research  we  are  interested  more  in  the  relative  im¬ 
pact  score  from  one  agent  to  another,  rather  than  what  the 
explicit  impact  score  is. 

Benefits  of  RUBI:  One  of  the  key  strengths  of  RUBI 
is  its  sheer  simplicity  and  generality  combined  with  com¬ 
puting  highly  informative  similarity  scores,  leading  to  well¬ 
performing  partitions.  It  needs  no  prior  knowledge  about 
the  domain  to  perform  partitioning,  and  simply  needs  a  lo¬ 
calized  reward  from  each  agent  to  build  the  similarity  ma¬ 
trix.  This  makes  RUBI  highly  generic  and  can  be  applied  to 
any  multiagent  domain. 

Additionally,  partitions  built  using  RUBI  are  likely  to  be 
greater  in  number  without  loss  of  performance.  Domain- 
based  partitioning  based  on  agent  similarity  encodes  how 
often  two  agents  impact  each  other.  RUBI  on  the  other  hand 
looks  more  into  how  the  actions  of  one  agent  impact  another 
agents  reward.  If  over  a  few  thousand  trials  the  reward 
impact  of  two  agents  is  always  0,  those  agents  actions  never 
impact  each  others  reward.  The  same  is  true  if  the  reward 
impact  is  always  the  same  non-zero  value,  the  actions  do  not 
affect  the  reward,  therefore  they  are  not  partitioned.  This  is 
a  key  feature  of  RUBI,  and  leads  to  finding  more  partitions 
without  loss  of  performance. 

3.  RUBI  PERFORMANCE  IN  THE  ATFMP 

We  test  RUBI  in  the  Air  Traffic  Flow  Management  Prob¬ 
lem  (ATFMP).  The  approach  used  here  is  the  same  as  in 
Curran  et  al.  [2]  and  Agogino  and  Rios  [1,  4],  except  utiliz¬ 
ing  RUBI. 

The  system-level  reward  in  the  ATFMP  focuses  on  the  cu¬ 
mulative  delay  (5)  and  congestion  ( C )  throughout  the  sys¬ 
tem:  G(z)  =  —  ( C(z )  +  5(z)).  Agogino  and  Rios  originally 


Figure  1:  As  the  number  of  partitions  decreases, 
performance  improves  while  time  complexity  in¬ 
creases. 

had  the  idea  of  adding  a  greedy  scheduler  to  algorithmically 
remove  congestion  from  the  system,  while  simultaneously  us¬ 
ing  learning  to  minimize  delay.  We  follow  this  approach,  and 
therefore  our  system-level  reward  is  simply:  G(z)  =  —8(z) 
With  so  many  agents,  tens  of  thousands  of  actions  simul¬ 
taneously  impact  the  system,  causing  the  reward  for  a  spe¬ 
cific  agent  to  become  noisy  with  the  actions  of  other  agents. 
A  difference  reward  shaping  function  reduces  much  of  this 
noise,  and  is  easily  derived  from  the  system-level  reward: 
Di(z )  =  S(z  —  Zi  +  cQ  —  8(z),  where  <5(2  —  Zi  +  cQ  is  the 
cumulative  delay  of  all  agents  with  agent  i  replaced  with 
counterfactual  c;.  Without  RUBI,  this  reward  shaping  tech¬ 
nique  makes  the  ATFMP  computationally  intractable. 

Partitions  developed  using  RUBI  uses  similarity  metrics 
that  encapsulated  the  agent  coupling.  Partitioning  with 
RUBI  and  the  difference  reward  outperformed  the  greedy 
scheduler.  Figure  1  shows  a  variety  of  partitions  out  per¬ 
forming  the  greedy  scheduler.  The  key  benefit  of  RUBI- 
based  partitioning  was  that  a  reward  independent  partition 
involved  61  partitions,  but  in  domain-based  partitioning  the 
smallest  was  3.  This  leads  to  a  10%  faster  processing  time, 
at  no  cost  to  performance  and  using  no  domain  knowledge. 
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